Frontiers in Artificial Intelligence — Latest Matching Preprints

1

The Independence of Discrimination and Calibration in Clinical Risk Prediction: Lessons from a Multi-Timeframe Diabetes Prediction Framework

OReilly, E.; Kurakovas, T.

2026-02-14 health informatics 10.64898/2026.02.12.26346147 medRxiv

Top 0.1%

13.8%

Show abstract

BackgroundClinical risk prediction models are typically evaluated by discrimination (area under the receiver operating characteristic curve, AUC), with calibration receiving less attention. We developed a multi-timeframe diabetes prediction framework emphasizing calibration and used synthetic data validation to investigate whether good discrimination guarantees good calibration. MethodsWe generated 500,000 synthetic patients using published epidemiological parameters from QDiabetes-2018, FINDRISC, and the Diabetes Prevention Program. The framework comprises a discrete-time survival ensemble with isotonic calibration, producing predictions at 6, 12, 24, and 36 months with bootstrap confidence intervals. We evaluated discrimination (AUC), bin-level calibration (expected calibration error, ECE), calibration-in-the-large (observed-to-expected ratio), and clinical utility (decision curve analysis). We compared performance against QDiabetes-2018 implemented on the same synthetic cohort. ResultsDespite achieving excellent discrimination (AUC = 0.844, 95% CI: 0.840-0.848) and low bin-level calibration error (ECE = 0.006), the framework systematically overpredicted risk by 50%: mean predicted probability was 8.4% versus observed rate of 5.6% (observed-to-expected ratio = 0.66, 95% CI: 0.65-0.67). This miscalibration occurred despite isotonic regression on a held-out calibration set. Overprediction was present in 9 of 10 risk deciles. Risk stratification remained valid (23.5-fold separation, 95% CI: 22.8-24.3, between highest and lowest tiers), confirming that discrimination was preserved. QDiabetes-2018 achieved comparable discrimination (AUC = 0.831) with better calibration (O:E = 0.89). Decision curve analysis showed net benefit across threshold range 5-30%, though recalibration would improve clinical utility. ConclusionsGood discrimination does not guarantee good calibration. Our primary finding is negative: isotonic calibration failed to produce well-calibrated predictions even on synthetic data from a single generator. This has important implications for model deployment, where distribution shift is inevitable. We recommend that prediction model studies report calibration-in-the-large alongside bin-level metrics, as ECE alone can be misleading when risk distributions are skewed. Recalibration on deployment populations will likely be necessary for any prediction model, regardless of development-phase calibration performance. Key MessagesO_ST_ABSWhat is already knownC_ST_ABSO_LIClinical prediction models require both discrimination (ranking patients correctly) and calibration (accurate probability estimates) C_LIO_LIIsotonic regression is a recommended approach for post-hoc calibration C_LIO_LIExpected calibration error (ECE) is commonly reported as a summary calibration metric C_LI What this study addsO_LIDemonstrates empirically that excellent discrimination (AUC = 0.844) can coexist with substantial miscalibration (50% overprediction) C_LIO_LIShows that low ECE can be misleading when most patients fall in low-risk deciles C_LIO_LIProvides evidence that isotonic calibration on held-out data may not generalize even within synthetic data from one generator C_LIO_LIDemonstrates a discrete-time survival architecture that reduces monotonicity violations to <0.1% C_LI How this study might affect research, practice, or policyO_LIPrediction model studies should report calibration-in-the-large (O:E ratio) alongside ECE C_LIO_LIDevelopers should expect recalibration to be necessary when deploying to new populations C_LIO_LIClaims of calibrated prediction should be viewed skeptically without comprehensive calibration assessment C_LI

2

Identification of Suicide-Related Subgroups Using Latent Class Analysis: Complementary Insights to Explainable AI-Based Classification

Kizilaslan, B.; Mehlum, L.

2026-03-27 psychiatry and clinical psychology 10.64898/2026.03.25.26349264 medRxiv

Top 0.1%

10.1%

Show abstract

Purpose: Suicide and self-harm are major public health concerns characterized by substantial clinical and psychosocial heterogeneity. While latent class analysis has been used to identify subgroups of people with suicidal behavior, the extent to which such population-level phenotyping complements explainable artificial intelligence-based classification models remain unclear. Methods: We applied latent class analysis to a cross-sectional, publicly available dataset of 1000 individuals presenting with self-harm and suicide-related behaviors at Colombo South Teaching Hospital, Kalubowila, Sri Lanka. Sociodemographic, psychosocial, and clinical variables were used to identify latent subgroups. Class characteristics and suicide prevalence were examined and compared with variable importance patterns reported in a previously published explainable artificial intelligence (XAI)-based suicide classification study using the same dataset. Results: Four latent classes were identified. Two classes exhibited very high suicide prevalence (91.2% [95% CI: 87.7-93.8] and 99.0% [95% CI: 96.4-99.7]), whereas two classes showed low prevalence (<1%). The two high-prevalence classes differed markedly in lifetime psychiatric hospitalization history, with one class showing a 100% prevalence of prior hospitalization and the other substantially lower hospitalization rates. These patterns partially aligned with, and extended beyond, variable importance findings from the XAI-based model. Conclusion: Latent class analysis identified distinct subgroups with substantially different suicide prevalence and clinical profiles, underscoring the heterogeneity of individuals presenting with self-harm. Comparison with XAI-based suicide classification model findings suggest that unsupervised phenotyping and supervised classification provide complementary perspectives, offering population-level context that may enhance the interpretability of suicide assessment frameworks. Keywords: suicide; self-harm; latent class analysis; explainable artificial intelligence; machine learning

3

Classification of Adolescent Drinking via Behavioral, Biological, and Environmental Features: A Machine Learning Approach with Bias Control

Liu, R.; Azzam, M.; Zabik, N.; Wan, S.; Blackford, J.; Wang, J.

2026-02-26 addiction medicine 10.64898/2026.02.24.26347002 medRxiv

Top 0.1%

8.7%

Show abstract

In 2024, approximately 30% of U.S. adolescents reported having consumed alcohol at least once in their lifetime, with about 25% of these individuals engaging in binge drinking. Adolescent alcohol use is associated with neurodevelopmental impairments, elevated risk of later alcohol use, and mental health disorders. These findings underscore the importance of identifying the variables driving adolescent alcohol use and leveraging them for early identification and targeted intervention. Previous studies have typically developed machine-learning classification models that use neuroimaging data in combination with limited clinical measurements. Neuroimaging data are expensive and difficult to obtain at scale, whereas clinical measures are more practical for large-scale screening due to their low cost and widespread accessibility. However, clinical-only approaches for alcohol drinking classification remain largely underexplored. Furthermore, prior studies have often focused on adults, limiting generalizability to the broader adolescent population. Additionally, confounding factors such as age and substance use, which are strongly correlated with alcohol consumption, have often been inadequately addressed, potentially inflating classification performance. Finally, class imbalance remains a persistent challenge, with prior attempts yielding only limited improvements. To address these limitations, we propose FocalTab, a framework that integrates TabPFN with focal loss for robust generalization and effective mitigation of class imbalance. The approach also incorporates an initial preprocessing step to remove confounding factors to account for age and substance-use. We compare FocalTab against state-of-the-art methods across different variable selections and dataset settings. FocalTab achieves the highest accuracy (84.3%) and specificity (80.0%) in the most stringent setting, in which both age and substance use variables were excluded, whereas competing models drop to near-chance specificity (12-24%). We further applied SHapley Additive exPlanations (SHAP) analysis to identify key clinical predictors of drinker classification, supporting enhanced screening and early intervention.

4

Estimating the Smallest Worthwhile Difference (SWD) of Psychotherapy for Alcohol Use Disorder: Protocol for a Cross-Sectional Survey

Sahker, E.; Lu, I.; Eddie, D.; So, R.; Luo, Y.; Omae, K.; Tajika, A.; Angelo, J. P.; Crisp, T.; Coffin, B.; Furukawa, T. A.

2026-02-27 addiction medicine 10.64898/2026.02.16.26346220 medRxiv

Top 0.1%

8.3%

Show abstract

BackgroundPsychotherapy is proven efficacious for the treatment of alcohol use disorder (AUD). However, the patient-perceived importance of its effect is not fully appreciated in the evidence base. The smallest worthwhile difference (SWD) represents the smallest beneficial effect of an intervention that patients deem worthwhile in exchange for the harms, expenses, and inconveniences associated with the intervention, and facilitates the interpretation of patient perceived worthiness of an intervention. MethodsThe proposed study will estimate the SWD of NIAAA recommended psychotherapies for AUD treatment with English-speaking American respondents aged 18 and older. Primary participants will be recruited using the Prolific research crowdsourcing site. The SWD will be estimated using the Benefit-Harm Trade-off Method, presenting survey respondents with variable, hypothetical magnitudes of psychotherapy outcomes to find the smallest acceptable effect over a natural remission alternative. The overall average SWD, and subgroup distributions by participant AUD treatment experiences and AUD symptomology will be described. Secondary findings will estimate the smallest recommendable risk difference for AUD psychotherapy from providers and criminal justice professionals. Expected ResultsWe expect to find an estimate of the SWD for AUD psychotherapy. Further, we expect that the SWD will vary between clinical subgroups based on AUD symptomology and treatment experiences. We expect differences in SWDs between the general population and those of providers and criminal justice professionals. Findings from this project will inform the treatment decision process about psychotherapy during the clinical consultation for people with AUD.

5

Class imbalance correction in artificial intelligence models leads to miscalibrated clinical predictions: a real-world evaluation

Roesler, M. W.; Wells, C.; Schamberg, G.; Gao, J.; Harrison, E.; O'Grady, G.; Varghese, C.

2026-03-05 health informatics 10.64898/2026.03.04.26347634 medRxiv

Top 0.1%

8.2%

Show abstract

BackgroundPredictive models employing machine learning algorithms are increasingly being used in clinical decision making, and improperly calibrated models can result in systematic harm. We sought to investigate the impact of class imbalance correction, a commonly applied preprocessing step in machine learning model development, on calibration and modelled clinical decision making in a large real-world context. MethodsA histogram boosted gradient classifier was trained on a highly imbalanced national dataset of >1.8 million patients undergoing surgery, to predict the risk of 90-day mortality and complications after surgery. Class imbalance correction strategies including random oversampling, synthetic minority oversampling technique, random under-sampling, and cost-sensitive learning were compared to the natural distribution ( natural). Models were tested and compared with classification metrics, calibration plots, decision curve analysis, and simulated clinical impact analysis. ResultsThe natural model demonstrated high performance (AUROC 0.94, 95% CI 0.94-0.95 for mortality; 0.84, 95% CI 0.84-0.85 for complications) and calibration (log loss 0.05, 95% CI 0.04-0.05 for mortality; 0.23, 95% CI 0.23-0.24 for complications). Class imbalance mitigation (CSL, ROS, RUS, and SMOTE) did not improve AUROC or AUPRC but increased recall and F1 scores at the expense of precision and accuracy. However, these methods severely compromised model calibration, leading to significant over-prediction of risks (up to a 62.8 % increase) as further evidenced by increased log loss across all mitigation techniques. Decision curve analysis and clinical scenario testing confirmed that the natural model provided the highest net benefit. ConclusionClass imbalance correction methods result in significant miscalibration, leading to possible harm when used for clinical decision making.

6

Adversarial Robustness of Capsule Networks for Medical Image Classification

Srinivasan, A.; Sritharan, D. V.; Chadha, S.; Fu, D.; Hossain, J. O.; Breuer, G. A.; Aneja, S.

2026-03-10 health informatics 10.64898/2026.03.09.26347900 medRxiv

Top 0.1%

6.7%

Show abstract

PurposeDeep learning models are increasingly being used in medical diagnostics, but their vulnerability to adversarial perturbations raises concerns about their reliability in clinical applications. Capsule networks (CapsNets) are a promising architecture for medical imaging tasks, given their ability to model spatial relationships and train with smaller amounts of data. Although previous studies have focused on adversarial training approaches to improve robustness, exploring alternative architectures is an underexplored direction for combating poor adversarial stability. Prior work has suggested that CapsNets may exhibit improved robustness to adversarial perturbations compared to convolutional neural networks (CNNs), but performance on adversarial images has not been studied systematically in clinical environments. We evaluated the robustness of CapsNets compared to CNNs and vision transformers (ViTs) across multiple medical image classification tasks. MethodsWe trained two CNNs (ResNet-18 and ResNet-50), one ViT (MedViT), and two CapsNets (DR-CapsNet and BP-CapsNet) on four distinct medical imaging datasets (PneumoniaMNIST, BreastMNIST, NoduleMNIST3D, and BloodMNIST) and one natural image dataset (MNIST). Models were evaluated on adversarial examples generated by projected gradient descent and fast gradient sign method across a range of perturbation bounds. Interpretability experiments, including latent space and Gradient-weighted Class Activation Mapping (Grad-CAM) analyses, were conducted to better understand model stability on adversarial inputs. ResultsCapsNets demonstrated superior robustness under adversarial perturbations compared to CNNs and ViTs across all medical imaging datasets and the natural image dataset. Latent space and Grad-CAM visualizations revealed that CapsNets maintained more consistent embedding representations and attention maps after adversarial perturbations compared to CNNs and ViTs, suggesting that advantages in CapsNet robustness are supported, at least in part, by more stable feature encodings. Bayes-Pearson routing further improved robustness over standard dynamic routing in CapsNets without compromising baseline performance, suggesting a potential architectural improvement. ConclusionCapsNets exhibit intrinsic advantages in adversarial robustness over CNN- and ViT-based models on medical imaging tasks, suggesting they are a reliable alternative for medical image classification. These findings support the use of CapsNets in clinical applications where model reliability is critical.

7

Interpretability as stability under perturbation reveals systematic inconsistencies in feature attribution

Piorkowska, N. J.; Olejnik, A.; Ostromecki, A.; Kuliczkowski, W.; Mysiak, A.; Bil-Lula, I.

2026-04-22 health informatics 10.64898/2026.04.20.26351354 medRxiv

Top 0.1%

6.5%

Show abstract

Interpreting machine learning models typically relies on feature attribution methods that quantify the contribution of individual variables to model predictions. However, it remains unclear whether attribution magnitude reflects the true functional importance of features for model performance. Here, we present a unified interpretability framework integrating permutation-based attribution, feature ablation, and stability under perturbation across multiple feature spaces. Using nested cross-validation and permutation-based null diagnostics, we systematically evaluate the relationship between attribution magnitude and functional dependence in clinical and biomarker-based prediction models. Attribution magnitude is frequently misaligned with functional importance, with weak to strong negative correlations observed across feature spaces (Spearman {rho} ranging from -0.374 to -0.917). Features with high attribution often have limited impact on model performance when removed, whereas features with low attribution can be essential for maintaining predictive accuracy. These discrepancies define distinct classes of interpretability failure, including attribution excess and latent dependence. Interpretability further depends on feature space composition, and stable, functionally relevant features are not necessarily those with the highest attribution scores. By integrating attribution, functional impact, and stability into a composite Feature Reliability Score, we identify features that remain informative across perturbations and analytical contexts. These findings indicate that interpretability does not arise from attribution magnitude alone but is better characterized from stability under perturbation. This framework provides a basis for more robust model interpretation and highlights limitations of attribution-centric approaches in high-dimensional and correlated data settings.

8

Biodesign Buddy: Integrating Generative Artificial Intelligence in Academic Biodesign

Riffle, D.; Rubery, P.

2026-03-13 scientific communication and education 10.64898/2026.03.11.710906 medRxiv

Top 0.1%

6.4%

Show abstract

Biodesign is an interdisciplinary research domain that incorporates principles from design and the life sciences to develop new systems, processes, and objects. Collegiate biodesign educators face unique pedagogical challenges, including an absence of relevant scholarship on curriculum design and instructional best practices for cultivating student scientific literacy. These difficulties may be overcome with newly available technologies, like generative AI systems, that enable personalized learning through domain-specific semantic spaces. This article examines the instructional value of one such domain-specific LLM, Biodesign Buddy, through a mixed-methods analysis of an eight-week study involving 64 students participating in an international biodesign competition. Results indicate strong support for integrating AI into biodesign coursework. Surveys captured attitudes toward AI, scientific literature, and learning experiences to assess AIs impact on learning outcomes. Findings suggest that integrating AI into biodesign pedagogy can meaningfully redress conceptual issues in biodesign while informing broader debates on AIs role in higher education. Impact StatementThis article introduces Biodesign Buddy, a domain-specific generative AI system for collegiate biodesign education, and reports on its exploratory deployment, offering design principles and preliminary findings to inform the development of AI-supported pedagogies for interdisciplinary biodesign instruction.

9

Cannabis Use Documentation within the Electronic Health Record: A Use Case for Natural Language Processing Methods

Pradhan, A. M.; Shetty, V. A.; Gregor, C.; Graham, J. H.; Tusing, L.; Hirsch, A. G.; Hall, E.; Troiani, V.; Davis, M. P.; Bieler, D. L.; Romagnoli, K. M.; Kraus, C. K.; Piper, B. J.; Wright, E. A.

2026-03-02 addiction medicine 10.64898/2026.02.27.26347207 medRxiv

Top 0.1%

6.2%

Show abstract

IntroductionRecreational and medical cannabis use (CU) information is often available within the electronic health record (EHR) in a format that is impractical for health care provider use. Transformation of free-text EHR documentation in notes to discrete elements is possible using natural language processing (NLP) and has the potential to characterize CU efficiently. The objective of this study was to develop an NLP algorithm to identify documentation of CU within EHR unstructured clinical notes. MethodsWe identified EHR notes with cannabis-related terminologies through a keyword search among all Geisinger patients with at least one encounter between 1/1/2013 and 6/30/2022. We trained four NLP models to classify notes into six categories based on time, context, and reliability of CU documentation identified through manual annotation. We compared the demographic characteristics of patients with positive classification for CU using the best-performing model to those of the overall population. ResultsOf the over 1.7 million eligible patients, 150,726 (8.6%) were flagged as cannabis users. The Bio-ClinicalBERT, a transformer-based NLP model, achieved close to human performance in classifying CU (weighted Precision=91.4, Recall=93.3, F-score=92.4). Cannabis users had higher BMI and were at least nine-fold more likely to use tobacco, alcohol, and illicit substances. ConclusionOur study evaluated the prevalence of CU documentation across the entire corpus of EHR notes data without population segmentation. The NLP methodologies used achieved performance close to that of human annotation and laid the foundation for identifying and classifying CU within unstructured data sources, with future applications in research and patient care. Plain Language SummaryMarijuana, also known as cannabis, may impact the health of patients, yet it is not routinely captured in medical records, and when documented, it is often found in unstructured formats (e.g., progress notes) rather than in discrete fields. Incomplete and unstructured capture limits many functional capabilities within the EHR that enhance patient care (e.g., drug interactions, notifications) and limit researchers from identifying patients routinely exposed to marijuana use. The transformation of free-text documentation of cannabis use (CU) into discrete elements can be performed using natural language processing (NLP). The objective of this study was to develop an NLP model to identify CU in unstructured clinical notes in the EHR. We examined the EHRs of Geisinger patients in Pennsylvania over a 10-year period. Among 1.7 million patients, 9% were identified as CU. One of the NLP models tested, Bio-ClinicalBERT, achieved the highest performance. Cannabis users had a higher BMI and were ten-fold more likely to be tobacco users, ten-fold more likely to use alcohol, and nine-fold more likely to use illicit substances. NLP can be used to better understand the risks and benefits of CU at a population level and may improve patient identification to assist clinical decision-making. Future CU epidemiological research should continue to explore other avenues to automate and improve CU documentation by leveraging rapidly evolving technologies, such as artificial intelligence-driven tools.

10

Development of Explainable Machine Learning Framework for Early Detection and Risk Stratification of Diabetes in Age Specific Variations

Lukhele, N.; Mostafa, F.

2026-04-27 health informatics 10.64898/2026.04.25.26351733 medRxiv

Top 0.1%

4.8%

Show abstract

Objective To develop and evaluate a novel machine learning (ML) framework tailored to a clinical diabetes dataset and to assess whether demographic stratification enhances model performance and interpretability for multiclass diabetes classification. Methods A clinical dataset of 264 patients records was used to classify individuals into non-diabetic, prediabetic and diabetic categories. Several supervised learning models were trained using 80:20 train-test split and optimized using RandomizedSearchCV Model and 10-fold cross validation. Model performance was evaluated using the metrics accuracy, precision, recall and the F1-score. Area under the receiver operating characteristic curve (AUC) was calculated for the best generalizing model. A structured ML framework was developed for this dataset, incorporating preprocessing, model optimization, age stratification analysis age (<35 vs >35 years) and gender. SHAP was developed for model interpretability. Results Ensemble methods demonstrated superior performance in comparison to linear or single-tree approaches, with Gradient Boosting showing the most stable generalization with a test accuracy of 0.981 and stable cross validation accuracy of 0.972. AUC-ROC analysis using Gradient Boosting yielded good discriminative ability across the three diabetes classes: 0.991 (non-diabetic), 0.986 (prediabetic) and 0.972 (diabetic). Stratified analysis showed improved reliability in individuals aged >;35 years (accuracy = 0.94, F1-score = 0.92), while performance in younger individuals was unstable due to small sample size. SHAP analysis identified HbA1c, BMI, and age as dominant predictors. Conclusion This study presents a ML framework integrating age stratified modelling with explainable ML frameworks to improve interpretability. The findings offer clinically relevant results that can support clinical decision-making systems, individualized risk assessment, and potential applications for targeted intervention in diabetes progression.

11

CardioPulmoNet: Modeling Cardiopulmonary Dynamics for Histopathological Diagnosis

Pham, T. D.

2026-02-20 health informatics 10.64898/2026.02.19.26346620 medRxiv

Top 0.1%

4.3%

Show abstract

ObjectiveThis study investigates whether incorporating physiological coupling concepts into neural network design can support stable and interpretable feature learning for histopathological image classification under limited data conditions. MethodsA physiologically inspired architecture, termed CardioPulmoNet, is introduced to model interacting feature streams analogous to pulmonary ventilation and cardiac perfusion. Local and global tissue features are integrated through bidirectional multi-head attention, while a homeostatic regularization term encourages balanced information exchange between streams. The model was evaluated on three histopathological datasets involving oral squamous cell carcinoma, oral submucous fibrosis, and heart failure. In addition to end-to-end training, learned representations were assessed using linear support vector machines to examine feature separability. ResultsCardioPulmoNet achieved performance comparable to several pretrained convolutional neural networks across the evaluated datasets. When combined with a linear classifier, improved classification performance and higher area under the receiver operating characteristic curve were observed, suggesting that the learned feature embeddings are well structured for downstream discrimination. ConclusionThese results indicate that physiologically motivated architectural constraints may contribute to stable and discriminative representation learning in computational pathology, particularly when training data are limited. The proposed framework provides a step toward integrating physiological modeling principles into medical image analysis and may support future development of transferable and interpretable learning systems for histopathological diagnosis.

12

Dynamic and Baseline Multi-Task Learning for Predicting Substance Use Initiation in the ABCD Study

Wei, M.; Zhang, H.; Peng, Q.

2026-04-13 addiction medicine 10.64898/2026.04.10.26350655 medRxiv

Top 0.1%

4.2%

Show abstract

Background: Early initiation of substance use is linked to later adverse outcomes, and risk factors come from multiple domains and are shared across substances. In our previous work, traditional time-to-event Cox models identified individual risk factors, but these models are not designed to jointly model multiple outcomes or capture complex non-linear relationships. Multi-task learning (MTL) can leverage shared structure across related outcomes to improve prediction and distinguish common versus substance-specific predictors. However, most MTL studies rely on baseline features and focus on single outcomes, which limits their ability to capture shared risk and temporal changes. Substance use initiation is a time-dependent process that unfolds during development and reflects changing exposures over time. Baseline-only models cannot capture these changes or represent risk dynamics. Discrete-time modeling provides a practical approach by estimating interval-level initiation risk and combining it into cumulative risk at the subject level. By integrating multi-task learning with dynamic modeling, it is possible to share information across outcomes while capturing how risk evolves over time, which may improve prediction performance. Methods: Using the Adolescent Brain Cognitive Development (ABCD) Study (release 5.1), we developed two complementary multi-task learning (MTL) frameworks to predict initiation of alcohol, nicotine, cannabis, and any substance use. A baseline MTL model predicted fixed- horizon (48-month) initiation using one record per participant, while a dynamic discrete-time MTL model incorporated longitudinal interval data to model time-varying risk. Both models used multi-domain environmental exposures, core covariates, and polygenic risk scores (PRS). Performance was evaluated on a held-out test set using AUROC, PR-AUC, and calibration metrics, and compared with single-task logistic regression (LR). Feature importance was assessed using permutation importance and compared with Cox proportional hazards models. Results: MTL showed comparable or improved performance relative to LR, with larger gains for low-prevalence outcomes (cannabis and nicotine). Incorporating longitudinal information led to consistent improvements across all outcomes. Dynamic models increased AUROC by +0.044 to +0.062 for MTL and +0.050 to +0.084 for LR, indicating that temporal information was the primary driver of performance gains. Feature importance analyses showed modest overlap across methods, with higher agreement between dynamic MTL and Cox models than static MTL. A small set of features, including externalizing behavior, parental monitoring, and developmental factors, were consistently identified across all approaches. Conclusions: Dynamic multi-task learning improves the prediction of substance use initiation by leveraging longitudinal structure and shared information across outcomes. While MTL provides additional gains, incorporating time-varying information is the dominant factor for improving performance. Combining baseline and dynamic frameworks offers a comprehensive strategy for identifying robust risk factors and modeling adolescent substance use initiation.

13

How Agent Role Structure Alters Operating Characteristics of Large Language Model Clinical Classifiers: A Comparative Study of Specialist and Deliberative Multi-Agent Protocols

Anderson, C. G.

2026-02-24 health informatics 10.64898/2026.02.22.26346818 medRxiv

Top 0.1%

4.2%

Show abstract

Large language models (LLMs) are increasingly deployed in structured clinical decision support, yet the architectural effects of internal role decomposition within multi-agent systems remain poorly isolated. Prior comparisons of single-agent and multi-agent prompting frequently confound workflow structure with changes in model configuration, training, or decoding. We present a controlled architectural study of role-structured inference under fixed model parameters, isolating internal role decomposition as the sole manipulated variable. Two deterministic multi-agent protocols, Generic Deliberative (GD) and Feature-Specialist (FS), are evaluated under identical base weights, decoding settings, computational budget, and adjudication logic. Across two tabular clinical benchmarks (UCI Cleveland Heart Disease and Pima Indians Diabetes), altering role structure alone systematically reshapes operating characteristics. On Cleveland, FS improves accuracy by 0.07 and macro-F1 by 0.06 relative to GD, while shifting the operating point toward higher specificity (+0.22) and lower sensitivity (-0.13), substantially reducing false positives. On Pima, architectural effects reverse direction: GD achieves the strongest macro performance (accuracy 0.68, macro-F1 0.64), whereas FS induces pronounced class asymmetry (recall 0.95 for the positive class and 0.27 for the negative class). These findings demonstrate that internal role decomposition functions as a structured inductive bias that can materially alter error distributions without modifying model parameters. Multi-agent prompt architecture should therefore be treated as an explicit mechanism for controlling sensitivity-specificity trade-offs in safety-sensitive LLM decision systems.

14

Comparison of foundation models and transfer learning strategies for diabetic retinopathy classification

Li, L. Y.; Lebiecka-Johansen, B.; Byberg, S.; Thambawita, V.; Hulman, A.

2026-04-20 health informatics 10.64898/2026.04.17.26351092 medRxiv

Top 0.1%

4.2%

Show abstract

Diabetic retinopathy (DR) is a leading cause of vision impairment, requiring accurate and scalable diagnostic tools. Foundation models are increasingly applied to clinical imaging, but concerns remain about their calibration. We evaluated DINOv3, RETFound, and VisionFM for DR classification using different transfer learning strategies in BRSET (n = 16,266) and mBRSET (n = 5,164). Models achieved high discrimination in binary classification (normal vs retinopathy) in BRSET (AUROC 0.90-0.98), with DINOv3 achieving the best under full fine-tuning (AUROC 0.98 [95% CI: 0.97-0.99]). External validation on mBRSET showed decreased performance for all models regardless of the fine-tuning strategy (AUROC 0.70-0.85), though fine-tuning improved performance. Foundation models achieved strong discrimination but poor calibration, generally overestimating DR risk. While the generalist model, DINOv3, benefited from deeper fine-tuning, miscalibration remained evident. These findings underscore the need to improve calibration and the comprehensive evaluation of foundation models, which are essential in clinical settings. Author summaryArtificial intelligence is increasingly being used to detect eye diseases such as diabetic retinopathy from retinal images. Recent advances have introduced "foundation models," which are trained on large datasets and can be adapted to new tasks. We aimed to evaluate how well these models perform in a clinical prediction context, with a focus not only on accuracy but also on how reliably they estimate disease risk. In this study, we compared different types of foundation models using two independent datasets from Brazil. We found that while these models were generally good at distinguishing between healthy and diseased eyes, their predicted risks were often poorly calibrated. In other words, the estimated probabilities did not consistently reflect the true likelihood of disease. We also examined whether adapting the models to the target population could improve performance. Although this approach led to improvements, calibration issues remained. However, post-training correction improved the agreement between predicted risks and observed outcomes. Our findings highlight an important gap between model performance and clinical usefulness. We suggest that improving the reliability of risk estimates is essential before such systems can be safely used in healthcare.

15

The Risk Factors, Detection and Classification of Esophageal Cancer Using Ensemble Machine Learning Models

Gaso, M. S.; Mekuria, R. R.; Cankurt, S.; Deybasso, H. A.; Abdo, A. A.; Abbas, G. H.

2026-03-11 health informatics 10.64898/2026.03.09.26347944 medRxiv

Top 0.1%

4.1%

Show abstract

Esophageal cancer (EC) remains one of the most lethal malignancies worldwide, with poor survival outcomes largely attributable to late-stage diagnosis and limited treatment effectiveness. Early detection and accurate risk stratification are therefore essential for improving clinical management. In this study, we investigate the predictive value of socio-demographic, dietary, behavioral, environmental, and clinical variables collected from 312 individuals (104 EC cases and 208 controls) in the Arsi Zone, Ethiopia. An ensemble features ranking approach based on Random Forest machine learning was first applied to identify the most relevant predictive features. Subsequently, multiple ensemble machine learning models were evaluated, including Histogram-based Gradient Boosting (Model I), Extreme Gradient Boosting (Model II), AdaBoost (Model III), Random Forest (Model IV), and k-Nearest Neighbors (Model V). These models were tested under multiple experimental settings using both full and reduced feature subsets. To enhance robustness and minimize variability, a multi-seed ensemble framework was employed. Different seed values generate distinct train-test splits and slight variations in model initialization and optimization, leading to minor differences in training outcomes; aggregating results across multiple seeds mitigates this variability and provides more stable and reliable performance estimates. The experimental results demonstrate that boosting-based ensemble models consistently outperform other classifiers across all evaluation metrics. Model I achieved the highest overall performance, reaching an accuracy of 0.983, with precision of 0.982, recall of 0.980, and F1-score of 0.981 using the reduced feature set, while maintaining nearly identical performance with the full feature set. Model II also showed stable and strong predictive capability, achieving accuracies of 0.963 and 0.961 for the full and reduced feature sets, respectively, with balanced precision, recall, and F1-score values. These findings indicate that feature importance-based dimensionality reduction preserves essential predictive information without compromising classification performance. Overall, the results highlight the significant predictive contribution of dietary and environmental risk factors and demonstrate that ensemble learning provides a reliable, efficient, and clinically meaningful approach for early EC detection. The proposed framework offers a promising direction for supporting diagnostic decision-making and risk stratification in resource-limited healthcare settings. HighlightsO_LIMachine Learning Framework for Esophageal Cancer Classification A robust ensemble machine learning framework was developed to classify esophageal cancer using socio-demographic, dietary, behavioral, environmental, and clinical risk factors, enabling accurate and reliable disease prediction. C_LIO_LIMulti-Seed Ensemble Strategy for Improved Model Stability A novel multi-seed ensemble classification approach was implemented to reduce model variance and improve robustness by aggregating predictions across multiple randomized training and testing splits. C_LIO_LIEnsemble Feature Ranking for Optimal Feature Selection An ensemble Random Forest-based feature ranking framework was designed to identify the most predictive features, ensuring stable biomarker selection and improved model interpretability. C_LIO_LIHigh Classification Performance with Reduced Feature Set The proposed ensemble HGBC model achieved outstanding performance with 98.3% accuracy, 98.2% precision, 98.0% recall, and 98.1% F1-score using a reduced feature subset, demonstrating efficient dimensionality reduction without performance loss. C_LIO_LIExceptional Discriminative Ability with Near-Perfect AUC The ensemble HGBC model achieved an AUC of 0.994, indicating excellent discrimination between cancer and non-cancer cases and confirming its suitability for high-precision clinical decision support. C_LIO_LIZero False-Negative Predictions and Maximum Diagnostic Sensitivity The proposed model achieved zero false negatives in evaluation, resulting in 100% statistical power and perfect sensitivity, ensuring reliable detection of esophageal cancer cases. C_LIO_LIIdentification of Key Dietary and Environmental Risk Factors Feature importance analysis revealed that dietary habits, hot food consumption, environmental exposures, and behavioral factors are among the most significant predictors of esophageal cancer risk. C_LIO_LIEnsemble Learning Outperforms Traditional Machine Learning Models Boosting-based ensemble models, particularly HGBC and XGBoost, consistently outperformed other classifiers, demonstrating superior predictive accuracy, stability, and robustness. C_LIO_LIEfficient and Interpretable AI Framework for Clinical Decision Support The proposed framework balances high predictive accuracy with interpretability, making it suitable for assisting clinicians in early diagnosis and risk stratification of esophageal cancer. C_LIO_LIAI-Driven Solution for Resource-Constrained Healthcare Settings The proposed ensemble machine learning approach provides an effective and scalable diagnostic support tool, particularly valuable for healthcare systems with limited resources and access to specialized medical expertise. C_LI

16

How to gain valuable insight from scarce data with Machine Learning: a post-hoc explanation tool to identify biases in biological images classification

Bolut, C.; Pacary, A.; Pieruccioni, L.; Ousset, M.; Paupert, J.; Casteilla, L.; Simoncini, D.

2026-02-20 bioinformatics 10.64898/2026.02.20.706981 medRxiv

Top 0.1%

3.8%

Show abstract

Machine learning (ML) models are effective at classifying images across various fields, including biology. However, their performance on biomedical images is often limited by the small size of available datasets that are constrained by the time-consuming and costly nature of experimental data collection. A review of the literature shows that many studies using biomedical images fail to follow ML best practices. This study focuses on regenerative medicine, which aims to promote tissue regeneration rather than scarring. To explore this process, we applied ML to a limited dataset of images of mice tissues, aiming to distinguish between regenerating and scarring samples. As expected binary classification failed to generalize to independent data. A novel SHAP-based analysis revealed that the overfitting models were based on spurious correlations including individual mice characteristics that aligned with the regeneration/scarring labels. The models appeared to be solving the binary classification task, but were in fact recognizing individuals. To investigate this behavior further, we examined the test set confusion matrix of a model trained to identify individual mice. We observed that, beyond individual recognition, individuals were grouped according to the time elapsed after injury (day 3 or 10) and the healing outcome (regeneration or scarring). We hypothesized that these groupings were based on relevant biological information captured by the model. To test this hypothesis, we successfully trained a model to classify images according to the time elapsed after injury (3 or 10 days), demonstrating that ML can extract relevant biological information when the task is aligned with what the data can actually support. Altogether, this study demonstrates that carefully examining explanations of a model is not only an effective way to unveil putative biases but also to extract relevant information from a limited dataset. Author summaryMachine learning is increasingly used to analyze biomedical images, but in many experimental settings only small datasets are available, which can easily mislead powerful models. In this study, we looked at images from mice tissues, with the goal to distinguish healing by regeneration from healing by scarring. Although standard machine learning models appeared to perform well during training, they failed to generalize to new animals. By carefully analyzing model explanations, we found that the models were not learning biologically meaningful patterns of tissue repair, but instead were recognizing individual mice based on subtle image-specific signatures. Importantly, this same analysis revealed that the models did capture relevant biological information when the task was better aligned with the data, such as distinguishing early versus late stages of healing. Our results highlight how explanation methods can uncover hidden biases, prevent false conclusions, and help researchers extract meaningful biological insights even from limited and imperfect datasets.

17

Selectively Augmented Decision Tree for Explainable Dementia Detection

Kamalov, F.; Thabtah, F.; Peebles, D.; Ibrahim, A.

2026-02-04 health informatics 10.64898/2026.02.03.26345441 medRxiv

Top 0.1%

3.6%

Show abstract

Timely and accurate diagnosis of dementia remains a critical yet challenging task. Although machine learning (ML) techniques have shown considerable promise in dementia detection, their inherent complexity often results in opaque, "black-box" models that limit clinical acceptance and usability. In this paper, we propose a Selectively Augmented Decision Tree (SADT), an interpretable AI model specifically designed for dementia detection. SADT incorporates a structured three-phase pipeline consisting of feature selection, data balancing, and construction of a transparent decision tree classifier. We apply SADT to the OASIS dataset and evaluate it empirically, showing that SADT outperforms traditional ML benchmarks, validating its effectiveness. In addition to its superior performance, SADT also mirrors aspects of human decision-making in its sequential, rule-based prioritization of key features. This approach aligns with cognitive models of cue use and heuristic reasoning, making it not only clinically transparent but also psychologically aligned with how diagnostic decisions are often made in practice. SADTs strong predictive performance and interpretability grounded in human reasoning facilitates explanation and human scrutiny, and has the potential to improve both clinical decision-making and trust in AI-assisted diagnosis.

18

Perioperative Mortality Prediction Using a Bayesian Ensemble with Prevalence-Adaptive Gating

Pandey, A. K.

2026-04-06 health informatics 10.64898/2026.04.03.26350114 medRxiv

Top 0.1%

3.6%

Show abstract

Background: Perioperative mortality prediction in resource-limited surgical settings remains challenging due to class imbalance, missing data, and the heterogeneity of postoperative complications. Existing risk scores such as POSSUM depend on intraoperative variables and do not quantify prediction uncertainty. Methods: We developed a prevalence-adaptive Bayesian ensemble comprising three stochastic models: classifier Variational Autoencoder (VAE, AUC=0.95), a Flipout Last Layer network (AUC=0.84), and a Monte Carlo Dropout network (AUC=0.80), trained on 697 patients (39 deaths, prevalence 5.59%) with 67 preoperative and postoperative features. Class imbalance (16.9:1) was addressed through Variational Autoencoder augmentation: two class-conditional generative VAEs produced 619 synthetic survivor and 619 synthetic death records, yielding a balanced training corpus of 1,935 samples. VAE augmentation was selected over SMOTE and random oversampling after a comparative study (F1: random oversampling 0.61 vs VAE augmentation 0.77). Validation used a held-out set of 233 patients (13 deaths, 220 survivors). A six-stage prediction pipeline incorporated weighted base risk, a three-path prevalence-adaptive gate, Shannon entropy uncertainty quantification, and rank-transform calibration. Sensitivity analysis was conducted across all six empirically derived hyperparameters. A whole-cohort death audit evaluated all 52 deaths from the complete 930-patient dataset through the deployed clinical decision support system. Statistical analysis included Kruskal-Wallis testing of entropy across triage groups, Wilson score confidence intervals for performance metrics, and Spearman rank correlation for LIME-SHAP interpretability concordance. Results: On the validation cohort the ensemble achieved complete separation (sensitivity 100%, specificity 100%, Youden J=1.000; TP=13, FP=0, TN=220, FN=0). The whole-cohort death audit identified 36 of 52 deaths (sensitivity 69.2%, 95% CI 55.7%-80.1%; precision 100%, 95% CI 90.4%-100.0%; F1=0.818, bootstrap 95% CI 0.732-0.894). Shannon entropy differed significantly across triage levels (Kruskal-Wallis H(2)=24.212, p<0.001, {epsilon}2=0.453), confirming a monotone gradient SAFE < CRITICAL < GRAY ZONE. All six hyperparameters were invariant across their tested ranges (J=1.000 throughout; Supplementary Tables S1-S2). LIME and SHAP rankings showed statistically significant concordance (Spearman {rho}=0.440, p=0.024; Kendall T=0.357, p=0.011), with 4 of 6 principal mortality determinants shared across both methods. Conclusions: A prevalence-adaptive Bayesian ensemble with entropy-based uncertainty triage achieves zero false positive alerts and clinically meaningful audit sensitivity in perioperative mortality prediction. Complete hyperparameter invariance confirms that reported performance reflects structural properties of the calibration architecture. The 16 missed deaths represent feature-invisible cases beyond current observational feature capacity.

19

Outcome Risk Modeling for Disability-Free Longevity: Comparison of Random Forest and Random Survival Forest Methods

Vanghelof, J. C.; Tzimas, G.; Du, L.; Tchoua, R.; Shah, R. C.

2026-02-17 health informatics 10.64898/2026.02.13.26346264 medRxiv

Top 0.1%

3.5%

Show abstract

BackgroundWhen creating risk prediction models for time-to-event data, methods that incorporate time are typically used. Random survival forests (RSF), an extension of random forests (RF), are one such class of models. We compared RSF to RF in the context of time-to-event outcomes in the ASPirin in Reducing Events in the Elderly (ASPREE) randomized controlled trial. We hypothesize that RSF will have superior discrimination and calibration versus RF. MethodsParticipants from ASPREE residing outside the US or with missing data were excluded. A total of 2,291 participants were assigned 1:1 into training and test sets. RF and RSF models were trained using a total of 115 measures as candidate predictors. The outcome of interest was the earliest of incident dementia, physical disability, or death. ResultsThe primary endpoint occurred in 10.5% of participants. Discrimination was similar between the models: sensitivity ([~]0.75), specificity ([~]0.57), positive predictive value ([~]0.17), time dependent AUC ([~]0.71), and Harrells concordance ([~]0.73). Calibration was likewise similar, Brier score ([~]0.09). DiscussionThe RF and RSF models exhibited comparable discrimination and calibration. We conclude that RSF may not always lead to more accurate predictions of outcomes compared to RF. Further examination in different clinical trial cohorts is needed to better understand the context in which adding time into outcomes risk modeling adds value.

20

When clinical prediction models do not generalize: a simulation study in liver transplantation

Brulhart, D.; Magini, G.; Schafer, A.; Schwab, S.; Held, U.

2026-03-20 health informatics 10.64898/2026.03.19.26348780 medRxiv

Top 0.1%

3.2%

Show abstract

Objectives: Clinical prediction models estimate the risk of a future outcome in patients. Such models are often externally validated using independent datasets; however, even when a model has been rigorously validated in a new setting and patient population, its performance across other clinical settings remains unclear. Therefore, we systematically evaluated model performance and clinical utility across diverse patient populations to quantify the limits of transportability. Methods: Using liver transplantation as an example, we used the UK donation-after-circulatory-death (DCD) risk score and descriptive statistics from Swiss DCD liver transplant populations to simulate realistic target populations with varying donor and recipient characteristics. The risk score's ability to predict one-year graft failure was evaluated using calibration intercept, calibration slope, area under the receiver operating characteristic (ROC) curve, and net benefit. Results: The UK DCD Risk Score's performance depended heavily on the simulated population characteristics. While the score performed adequately in settings similar to those where it was derived, it was not satisfactory in others. Discussion: The study showed, using a risk score in liver transplantation as an example, that the application of a prediction model can be limited in certain external populations when they differ, and that its transportability in new settings is not guaranteed. Conclusion: This study highlights the importance of external validation of clinical prediction models to determine transportability to various target populations. Their application requires careful consideration and potential model re-estimation.